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Trainable Nonlinear Reaction Diffusion: A 
Flexible Framework for Fast and Effective 

Image Restoration 

Yunjin Chen and Thomas Pock 


Abstract —Image restoration is a long-standing problem in low-level computer vision with many interesting applications. We describe a 
flexible learning framework based on the concept of nonlinear reaction diffusion models for various image restoration problems. By 
embodying recent improvements in nonlinear diffusion models, we propose a dynamic nonlinear reaction diffusion model with 
time-dependent parameters (/.e., linear filters and influence functions). In contrast to previous nonlinear diffusion models, all the 
parameters, including the filters and the influence functions, are simultaneously learned from training data through a loss based 
approach. We call this approach TNRD - Trainable Nonlinear Reaction Diffusion. The TNRD approach is applicable for a variety of 
image restoration tasks by incorporating appropriate reaction force. We demonstrate its capabilities with three representative 
applications, Gaussian image denoising, single image super resolution and JPEG deblocking. Experiments show that our trained 
nonlinear diffusion models largely benefit from the training of the parameters and finally lead to the best reported performance on 
common test datasets for the tested applications. Our trained models preserve the structural simplicity of diffusion models and take 
only a small number of diffusion steps, thus are highly efficient. Moreover, they are also well-suited for parallel computation on GPUs, 
which makes the inference procedure extremely fast. 

Index Terms —nonlinear reaction diffusion, loss specific training, image denoising, image super resolution, JPEG deblocking 
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1 Introduction 

I MAGE restoration is the process of estimating uncorrupted im¬ 
ages from noisy or blurred ones. It is one of the most fundamen¬ 
tal operations in image processing, video processing, and low-level 
computer vision. For several decades, image restoration remains 
an active research topic and hence new approaches are constantly 
emerging. There exists a huge amount of literature addressing the 
topic of image restoration problems, see for example [41] for a 
survey. 

In recent years, the predominant approaches for image restora¬ 
tion are non-local methods based on patch modeling, for example, 
image denoising with (i) Gaussian noise [40], [26], [15], [46], 
(ii) multiplicative noise [14], or (hi) Poisson noise [24], image 
interpolation [47], image deconvolution [20], etc. Most state-of- 
the-art techniques mainly concentrate on achieving utmost image 
restoration quality, with little consideration on the computational 
efficiency, e.g., [40], [26], [47], despite the fact that it is a critical 
factor for real applications. However, there are a few exceptions. 
For example, there are two notable exceptions for the task of Gaus¬ 
sian denoising, BM3D [15] and the recently proposed Cascade of 
Shrinkage Fields (CSF) [52] model, which simultaneously offer 
high efficiency and high image restoration quality. 

It is well-known that BM3D is a highly engineered Gaussian 


• Y.J. Chen is with the Institute for Computer Graphics and Vision, Graz 
University of Technology, 8010 Graz, Austria. 

E-mail: chenyunjin_nudt@hotmail. com 

• T. Pock is with the Institute for Computer Graphics and Vision, Graz 
University of Technology, 8010 Graz, Austria, as well as Digital Safety 
& Security Department, AIT Austrian Institute of Technology GmbH, 1220 
Vienna, Austria. E-mail: pock@icg.tugraz.at 

This work was supported by the Austrian Science Eund (EWE) under the 
China Scholarship Council (CSC) Scholarship Program and the START 
project BIVISION, No. Y729. 


image denoising algorithm. It involves a block matching process, 
which is challenging for parallel computation on GPUs, alluding 
to the fact that it is not straightforward to accelerate BM3D 
algorithm on parallel architectures. In contrast, the recently pro¬ 
posed CSF model offers high levels of parallelism, making it well 
suited for GPU implementation, thus owning high computational 
efficiency. 

In this paper, we propose a flexible learning framework to gen¬ 
erate fast and effective models for a variety of image restoration 
problems. Our approach is based on learning optimal nonlinear 
reaction diffusion models. The learned models preserve the struc¬ 
tural simplicity of these models and hence it is straightforward 
to implement the corresponding algorithms on massive parallel 
hardware such as GPUs. 

1.1 Nonlinear diffusion for image restoration 

Partial differential equation (PDFs) have become a standard ap¬ 
proach for various problems in image processing. On the one hand 
they come along with a sound mathematical framework that allow 
to make clear statements about the existence and regularity of the 
solutions. On the other hand, efficient numerical algorithms have 
been developed, that allow to compute the solution of PDFs in 
very short time [45], [55] . Although recent PDF approaches have 
shown good performance for a number of image processing task, 
they still fail to produce state-of-the-art quality for classical image 
restoration tasks. 

In the seminal work [45], Perona and Malik (PM) proposed a 
nonlinear diffusion model, which is given as the following PDF 

r ^ = div( 5 (|V«|)Vn) 
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where V is the gradient operator, t denotes the time, / is a initial 
image. The function g is known as edge-stopping function [7] 
or diffusivity function [55], and a typical ^-function is given 
by g{z) = 1/(1 + z^). The proposed PM diffusion model (1) 
leads to a nonlinear anisotropic^ diffusion model which is able to 
preserve and enhance image edges. Hence, it is well suited for a 
number of image processing tasks such as image denoising and 
segmentation. 


1.1.1 Improvements from the side of PDEs 
A first variant of the PM model is the so-called biased anisotropic 
diffusion (also known as reaction diffusion) proposed by Nord¬ 
strom [43], which introduces a bias term (forcing term) to free the 
user from the difficulty of specifying an appropriate stopping time 
for the PM diffusion process. This additional term reacts against 
the strict smoothing effect of the pure PM diffusion, therefore 
resulting in a nontrivial steady-state. 

Subsequent works consider modifications of the diffusion or 
the reaction term for the reaction diffusion model [21], [13], 
[ ], e.g., Acton et al. [1] exploited a more complicated reaction 
term to enhance oriented textures; [4] proposed to replace the 
ordinary diffusion term with a flow equation based on mean 
curvature. A notable work is the forward-backward diffusion 
model proposed by Gilboa et al. [23], which incorporates explicit 
inverse diffusion with negative diffusivity coefficient by carefully 
choosing the diffusivity function. The resulting diffusion processes 
can adaptively switch between forward and backward diffusion 
processes. In subsequent work [56], the theoretical framework for 
discrete forward-and-backward diffusion filtering has been inves¬ 
tigated. Researchers also propose to exploit higher-order nonlinear 
diffusion filtering, which involves larger linear filters, e.g., fourth- 
order diffusion models [28], [17], [27]. Meanwhile, theoretical 
properties about the stability and local feature enhancement of 
higher-order nonlinear diffusion filtering are established in [16]. 

It should be noted that the above mentioned diffusion models 
are handcrafted, including elaborate selections of the diffusivity 
functions, optimal stopping times and proper reaction forces. It 
is a generally difficult task to design a good-performing PDE for 
a specific image processing problem, as good insights into this 
problem and a deep understanding of the behavior of the PDEs 
are usually required. Therefore, an attempt to learn PDEs from 
training data via an optimal control approach was made in [39], 
where the PDEs to be trained have the form of 


{ ^ = k(u) + a{ty 0(u) 

l^lt=o = /- 

Coefficients a(t) are free parameters {i.e., combination weights) 
to train. is related to the TV regularization [49] and 0{u) 
denotes a set of operators (invariants) over u, e.g., ||Vit||2 = 


1.1.2 Improvements from the side of image statis¬ 
tics/regularization 

As shown in [51], [43], [34], [50], there exist a strong connec¬ 
tion between anisotropic diffusion models and variational models 


adopting image priors derived from the statistics of natural images. 
Let us consider the discrete version of the PM model (1), where 
images are represented as column vectors, i.e., u G The 
discrete PM model is formulated as the following discrete PDE 
with an explicit finite difference scheme 


ut+i - Ut 

At 


i={x,y} i={x,y} 


(3) 

where matrices and G are finite difference approx¬ 

imation of the gradient operators in x-direction and ^-direction, 
respectively and At denotes the time step. A{ut) G is 

defined as a diagonal matrix 


A{ut) = diag (^g (y^ ^ , 

where function g is the edge-stopping function mentioned be¬ 
fore. If ignoring the coupled relation between and \/yU, 
the PM model can also be written in the form (j){Viu) = 
(0(Vii4)i, • • • ,0(Vii4)Ar)^ G with function 
known as influence function [ ] or fiux function [55]. In this paper, 
we stick to this discrete and decoupled formulation, as it is the 
starting point of our approach. 

As shown in previous works, e.g., [51], [59], the diffusion step 
(3) corresponds to a gradient descent step to minimize the energy 
functional given as 


N 

ie{x,y} P=l 

the functions p {e.g., p{z) = log(l + z^)) is the so-called 
penalty function.^ It is worthwhile to mention that the matrix- 
vector product can be interpreted as a 2D convolution of 
u with the linear filter kx = [—1,1] (V^ corresponds to the 
linear filter ky = [—1,1]^). The energy functional (4) can be 
also understood from the aspects of image statistics, image prior 
and image regularization. As a consequence, a lot of efforts listed 
below have been made to improve the capability of model (4). 

a) More filters of larger kernel size were considered in [59], [48], 
[10], [3], instead of relatively small kernel size, such as usually 
used pair-wise kernels. The resulting regularization model leads 
to the so-called fields of experts (EoE) [48] image prior, which 
works well for many image restoration problems. 

b) Instead of hand-crafted ones with fixed shape, more fiexible 
penalty functions were exploited, and they were learned from 
data [59], [50], [34], [52]. Especially, as shown in [59] that 
those unusual penalties such as inverted penalties {i.e., p{z) 
decreasing as a function of | 2 ;|) were found to be necessary. 

c) In order accelerate the inference phase related to the model (4), 
in [3], [18], it was proposed to truncate the gradient descent 
procedure to fixed iterations/stages, and then train this truncated 
optimization algorithm based on the EoE prior model. 

d) In order to further increase the fiexibility of multi-stage models, 
Schmidt and Roth [52] considered varying parameters per stage. 


1.2 Motivations and Contributions 


1. Anisotropic diffusion in this paper is understood in the sense that the 
smoothing induced by PDEs can be favored in some directions and prevented 
in others. The diffusivity is not necessary to be a tensor. It should be noted that 
this definition is different from Weickert’s terminology [55], where anisotropic 
diffusion always involves a diffusion tensor, and the PM model is regarded as 
an isotropic model. 


In this paper we concentrate on nonlinear diffusion process due 
to its high efficiency. Taking into consideration the improvements 
mentioned in Sec. 1.1.2, we propose a trainable nonlinear diffusion 

2. Note that p'{z) = 4>{z). 
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model with (1) fixed iterations (also referred to as stages), (2) 
more filters of larger kernel size, (3) fiexible penalties in arbitrary 
shapes, (4) varying parameters for each iteration, i.e., time varying 
linear filters and penalties. Then all the parameters {i.e., linear 
filters and penalties) in the proposed model are simultaneously 
trained from training data in a supervised way. The proposed 
approach results in a novel learning framework to train effective 
image diffusion models. It turns out that the trained diffusion 
processes leads to state-of-the-art performance, while preserve 
the property of high efficiency of diffusion based approaches. 
In summary, our proposed nonlinear diffusion process offers the 
following advantages: 

1) It is conceptually simple as it is merely a standard nonlinear 
diffusion model with trained filters and infiuence functions; 

2) It has broad applicability to a variety of image restoration 
problems. In principle, all the diffusion based models can be 
revisited with appropriate training; 

3) It yields excellent results for several tasks in image restora¬ 
tion, including Gaussian image denoising,single image super 
resolution and JPEG deblocking; 

4) It is highly computationally efficient, and well suited for 
parallel computation on GPUs. 

A shorter paper has been presented as a conference version 
[12]. In this paper, we incorporate additional contents listed as 
follows 

1) We investigate more details of the training phase, such as the 
infiuence of (a) initialization, (b) the model capacity and (c) 
the number of training samples; 

2) We consider more detailed analysis of trained models, such 
as how the trained models generate the patterns; 

3) We exploit an additional application of single image super 
resolution to further illustrate the potential breadth of our 
proposed learning framework. 


2 Proposed reaction diffusion process 

In this section, we first describe our learning based reaction 
diffusion model for image restoration, and then we show the rela¬ 
tions between the proposed model and existing image restoration 
models. 

2.1 Proposed nonlinear diffusion modei 

The fundamental idea of our proposed learning based reaction 
diffusion model is described in Sec. 1.2. In addition, we incor¬ 
porate a reaction term in order to apply our model for different 
image processing problems, as shown later. As a consequence, 
our proposed nonlinear reaction diffusion model is formulated as 

Nk 

a7~' =-^Kf (5) 

^ ^ -y reaction term 

diffusion term 

where Ki G is a highly sparse matrix, implemented as 2D 

convolution of the image u with the filter kernel ki, i.e., KiU 
ki * u, Ki is a set of linear filters and is the number of filters. 
In practice, we set At = 1, as we can freely scale the functions 
(j)i and on the right hand side. 

Note that our proposed diffusion process is truncated after 
a few stages, usually less than 10. Moreover, the linear filters 
and infiuence functions are adjustable and vary across stages. As 


our proposed approach is inspired by nonlinear reaction diffusion 
model but with trainable filters and infiuence functions, we coin 
our method Trainable Nonlinear Reaction Diffusion (TNRD). 

In practice, we train the proposed nonlinear diffusion model 
(5) for specific image restoration problem by exploiting applica¬ 
tion specific reaction terms For classical image restoration 

problems, such as Gaussian denoising, image deblurring, image 
super resolution and image inpainting, we can set the reaction 
term to be the gradient of a data term, i.e. 'ip{u) = 

For example, if we choose V^{u^f) = — /Hi. we 

have = \^A^{Au — /), where / is the degraded input 

image, A is the associated linear operator, and is related to the 
strength of the reaction term. In the case of Gaussian denoising, 
A is the identity matrix; for image super resolution, A is related 
to the down sampling operation and for image deconvolution, A 
corresponds to the linear blur kernel. 


2.2 A more general formulation of the proposed diffu¬ 
sion modei 

Note that the data term V{u) related to (5) should be dif¬ 
ferentiable. In order to handle the problems involving a non- 
differentiable data term, e.g., the JPEG deblocking problem in¬ 
vestigated in Section 6, we consider a more general form of our 
proposed diffusion model as follows 


Nk 


ut = Proxgt ut-i - 


( 6 ) 


where Prox^t (u) is the proximal mapping operation [44] related 
to the non-differentiable function given as 


Proxgt (u) = min 

u 





2.3 Reiated image restoration modeis 

As mentioned in Sec. 1.1.2, there exist natural connection between 
anisotropic diffusion and image regularization based energy func¬ 
tional. Therefore, Eq. (6) can be interpreted as a forward-backward 
step^ [37] at Uf-i for the energy functional given by 


Nk 

{u, f ) = Y.n\{u) + D*(ti, /) + a* {U, f), (7) 

i=l 

where = Ef=i w)p) are the regularizers. Since 

the parameters {Kf, pjj vary across the stages, (7) is a dynamic 
energy functional, which changes at each iteration. 

If {Kf,pj} keep the same across stages, the functional (7) 
with Q = 0 corresponds to the FoE prior regularized variational 
model for image restoration [48], [11], [10]. In our work, we do 
not exactly solve this minimization problem anymore. In contrast, 
we run the gradient descent step for several iterations, and each 
gradient descent step is optimized by training. More importantly, 
we are allowed to investigate more generalized penalties. 

To the best of our knowledge, Zhu and Mumford [59] were 
the first to consider learning reaction diffusion models from data. 
The linear filters appearing in the prior were chosen (not fully 
trained) from a general filter bank by minimizing the entropy of 
probability distributions of natural images. The associated penalty 
functions were learned based on the maximum entropy principle. 


3. The forward step is a gradient descent step w.r.t the function V and the 
backward step is the proximal operation w.r.t the function Q. 
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Fig. 1. The architecture of our proposed diffusion model with a reaction force = X^A^{Au - f) and ^ = o. It is represented as a feed-forward 
network. Note that the additional convolution step with the rotated kernels ki (cf. Equ. 15) does not appear in conventional feed-forward CNs. 


However, our proposed diffusion model is a multi-stage diffusion 
process with multiple image priors, where the filters and penalties 
are fully trained from clean/degraded pairs. 

Some more works have also been dedicated to train the 
penalties in a diffusion model. In [34], the authors trained the 
penalty functions in the way that they first computed the image 
statistics appropriate to certain chosen filters {e.g., horizontal and 
vertical image derivatives), and then considered mixture models 
of fixed shape having a peak near zero and two heavy tails in 
both sides to fit to image statistics. Analogously, in [50], potential 
functions of fixed shape are chosen to resemble zero mean filter 
responses. 

In very recent work [52], Schmidt and Roth exploited an 
additive form of half-quadratic optimization to solve problem 
(7). The resulting model shares similarities with classical wavelet 
shrinkage and hence it is termed “cascade of shrinkage fields” 
(CSF). The CSF model relies on the assumption that the data 
term in (7) is quadratic and the operator A can be interpreted as 
a convolution operation, such that the corresponding subproblem 
can be solved in closed-form using the discrete Fourier transform 
(DFT). However, our proposed diffusion model does not have this 
restriction on the data term. In principle, any smooth data term is 
appropriate. Moreover, as shown in the following sections, we can 
even handle the case of non-smooth data terms. 

As already mentioned, [3], [18] also proposed to train an 
optimized gradient descent algorithm for the energy functional 
similar to (7). However, their model is much more constrained, 
since they exploited the same filters for each gradient descent 
step and more importantly, they make use of a hand-selected 
influence function. This clearly restricts the model capability, as 
demonstrated in Sec. 4. 

Comparing our model to the model of [39] (cf. (2)), one can 
see that this approach learns only a linear model based on pre¬ 
defined image invariants. Therefore, this model can be interpreted 
as a simplified version of our model (5) with fixed linear filters and 
influence functions, and only the weight of each term is optimized. 

The proposed diffusion model also bears an interesting link 
to convolutional networks (CNs) applied to image restoration 
problems in [32]. One can see that each iteration (stage) of our 
proposed diffusion process involves convolution operations with a 
set of linear filters, and thus it can be treated as a convolutional 
network. The architecture of our proposed diffusion model is 
shown in Figure 1, where it is represented as a common feed¬ 
forward network. We refer to this network in the following as 
diffusion network. 


Reaction force 



Fig. 2. Our diffusion network can also be interpreted as a CN with a 
feedback step, which makes it different from conventional feed-forward 
networks. Due to the feedback step, it can be categorized into recurrent 
networks [25]. 

However, we can introduce a feedback step to explicitly 
illustrate the special architecture of our diffusion network that 
we subtract “something” from the input image. Therefore, our 
diffusion model can be represented in a more compact way in 
Figure 2, where one can see that the structure of our CN model 
is different from conventional feed-forward networks. Due to this 
feedback step, it can be categorized into recurrent networks [25] . 
It should be noted that the nonlinearity (i.e., infiuence functions 
in the context of nonlinear diffusion) in our proposed network are 
trainable. However, conventional CNs make use of fixed activation 
function, e.g., the ReLU function [42] or sigmoid functions [32]. 

3 Learning framework 

We train our diffusion networks in a supervised manner, namely 
we firstly prepare the input/output pairs for a certain image pro¬ 
cessing task, and then exploit a loss minimization scheme to learn 
the model parameters 0^ for each stage t of the diffusion process. 
The training dataset consists of S training samples 
where is a noisy observation and Ug^ is the corresponding 
ground truth clean image. The model parameters 0t of each stage 
include the parameters of (1) the reaction force weight A, (2) linear 
filters and (3) infiuence functions, Le., 0^ = {A^, 0^, /c|}. 

3.1 Overall training modei 

In the supervised manner, a training cost function is required to 
measure the difference between the output of the diffusion network 
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and the ground-truth image. As our goal is to train a diffusion 
network with T stages, the cost function is formulated as 

S' 

>C(©i.....t) = (8) 

S = 1 


where Ut is the output of the final stage T. In our work we exploit 
the usual quadratic loss function^, defined as 

^(wr>«st) = (9) 


As a consequence, the training task is generally formulated as 
the following optimization problem 


min>C(©) = \\\ufp — u\ 


'gt\\2 


-1- ( + 


t = 

( 10 ) 

where 0 = and /q is the initial status of the diffusion 

process. Note that the above loss function only depends on the 
output of the final stage T, i.e., the parameters in all stages 
are simultaneously trained such that the output of the diffusion 
process - Ut is optimized. We call this training scheme joint 
training similar to [52]. The joint training strategy is a min¬ 
imization problem with respect to the parameters in all stages 
{01, 02, • • • , 0t}. 

One can see that our training model is also a deep model with 
many stages (layers). It is well-known that deep models are usually 
sensitive to initialization, and therefore training from scratch is 
prone to getting stuck at bad local minima. As a consequence, 
people usually consider a greedy layer-wise pre-training [5] to 
provide a good initialization for the joint training (fine tune). 

In our work, we also consider a greedy training scheme 
similar to [52], to pre-train our diffusion network stage-by-stage, 
where each stage is greedily trained such that the output of each 
stage is optimized, i.e., for stage t, we minimize the cost function 


s 

^(©0 = ( 11 ) 

S = 1 

where ul is the output of stage t of the diffusion process. Note that 
this is a minimization problem only with respect to the parameters 
0t in stage t. 


3.2 Parameterizing the infiuence functions 0-, iinear 
fiiters kj and weights 

In this paper, we aim to investigate arbitrary infiuence functions. In 
order to conduct a fast and accurate training, an effective function 
parameterization method is required. Following the work of [52], 
we parameterize the infiuence function via standard radial basis 
functions (RBFs), i.e., each function (j) is represented as a weighted 
linear combination of a family of RBFs as follows 

J=1 V 7g / 

4. This loss function is related to the PSNR quality measure. Note that as 
shown in [33], other quality measures, such as structural similarity (SSIM) 
and mean absolute error (MAE) can be chosen to define the loss function. At 
present we only consider the quadratic loss function due to its simplicity. 


where (p represents different RBFs. In this paper, we exploit RBFs 
with equidistant centers jij and unified scaling . We investigate 
two typical RBFs [31]: (1) Gaussian radial basis cpg and (2) 
triangular-shaped radial basis cpt, given as 


and 


‘Pgi.z) = ^ I —-— I = exp 




^t{z) = ^ 



\0 \z-fi\>j 


respectively. The basis functions are shown in Figure 3, together 
with an example of the function approximation by using two 
different RBF methods. In our work, we have investigated both 
function approximation methods, and we find that they lead 
to similar results. We only present the results obtained by the 
Gaussian RBF in this paper. 

In our work, the linear kernels kj related to the linear oper¬ 
ators Kj are defined as a linear combination of Discrete Cosine 
Transform (DCT) basis kernels 6^, i.e., 

,t ^irbr 

Hh ’ 

where the kernels k^ are normalized to get rid of an ambiguity 
appearing in the proposed diffusion model. More details can be 
found in the supplemental material. The kernels are formed in this 
way in order to keep the expected property of zero-mean. 

The weights in our model are constrained to be positive. 
To this end, we set A^ ^ in the training phase for our 
implementation. 


3.3 Computing gradients 

For both greedy training and joint training, we make use of 
gradient-based algorithms {e.g., the L-BFGS algorithm [38]) for 
optimization. The key point is to compute the gradients of the 
loss function with respect to the training parameters. In greedy 
training, the gradient of the loss function at stage t with respect to 
the model parameters 0t is computed using standard chain rule, 
given as 

di.{ut.,Ugi) _ duf di{ut^Ugt^ n 

where = Uf — Ugt is directly derived from (9), 

is computed from the diffusion process for specific task. For 
the applications exploited in this paper, such as image denoising 
with Gaussian noise, single image super resolution and JPEG 
deblocking, we present the detailed derivations of in the 
supplemental material. 

In the joint training, we compute the gradients of the loss func¬ 
tion with respect to 0t by using the standard back-propagation 
technique widely used in the neural networks learning [35], 
namely, is computed by using 

d£{uT,Ugt) _ dut+i d£{uT,Ugt) 

dQt dQt dut duT 

Compared to the greedy training, we additionally need to 
calculate • For the investigated image processing problems 
in this paper, we provide all necessary derivations in the supple¬ 
mental material. 
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Fig. 3. Function approximation via Gaussian ^g{z) or triangular-shaped radial basis function, respectively for the function = 
with s = ^ - Both approximation methods use 63 basis functions equidistantly centered at [-310 : 10 : 310]. 


3.4 Experimental setup and implementation details 

3.4.1 Boundary condition of the convolution operations 

In our convolution based diffusion network, the image size stays 
the same when an image goes through the network, and we use 
the symmetric boundary condition for convolution calculation. In 
our original diffusion model ( 6 ), there is matrix transpose K~^v, 
which exactly corresponds to the convolution operation k ^ v (k 
is obtained by rotating the kernel A: 180 degrees) in the cases of 
periodic and zero-padding boundary conditions. It should be noted 
that in the case of symmetric boundary condition used in this 
paper, this result holds only in the central image region. However, 
we still want to explicitly use the formulation k ^ v to replace 
K^v, because the former can significantly simplify the derivation 
of the gradients required for training. 

We find that the direct replacement introduces some artifacts 
at the image boundary. In order to avoid these artifacts, we 
symmetrically pad the input image before it is sent to the diffusion 
network, and then we discard those padding pixels in the final 
output. More details are found in the supplemental material. 

3.4.2 RBF kernels 

Images exploited in this paper have the dynamic range in [0, 255], 
and the filters have unit norm. In order to cover most of the 
filter response, we consider infiuence functions in the range 
[—310, 310]. We use 63 Gaussian RBFs with equidistant centers 
at [—310 : 10 : 310], and set the scaling parameter 7 = 10. 

3.4.3 Experimental setup 

Model capacity: In our work, we train the proposed diffusion 
network with at most 8 stages to observe its saturation behavior 
after certain stages. We first greedily train T stages of our model 
with specific model capacity, then conduct a joint training to refine 
the parameters of the whole T stages. 

In this paper, we mainly consider four different diffusion 
networks with increasing capacity: 

TNRD|^x 3 , Fully trained model with 8 filters of size 3x3, 

TNRD^>< 5 , Fully trained model with 24 filters of size 5x5, 

TNROfxy, Fully trained model with 48 filters of size 7x7, 

TNRD^>< 9 , Fully trained model with 80 filters of size 9x9, 

where TNRD^><^ denotes a nonlinear diffusion process of stage 
T with filters of size m x m. The filters number is — 1, 
if not specified. For example, TNRDy^y model contains (48 x 
48 (filters) + 48 x 63 (penalties) -f 1 (A)) • T = 5329 • T free 
parameters. 


Training and test dataset: In order to make a fair comparison to 
previous works, we make use of the same training datasets used 
in previous works for our training, and then evaluate the trained 
models on commonly used test datasets. For image processing 
problems investigated in this paper, i.e., Gaussian denoising, single 
image super resolution and JPEG deblocking, we consider the 
following training and test datasets, respectively. 

a) Gaussian denoising. Following [52], we use the same 400 
training images, and crop a 180 x 180 region from each image, 
resulting in a total of 400 training samples of size 180 x 180, 
i.e., roughly 13 million pixels. We then evaluate the denoising 
performance of a trained model on a standard test dataset of 
68 natural images, which is suggested by [48], and later widely 
used for Gaussian denoising testing. Note that the test images 
are strictly separate from the training datasets. 

b) Single image super resolution. The publicly available frame¬ 
work of Timofte et al. [54] provides a perfect base to com¬ 
pare single image super resolution algorithms. It includes 91 
training images and two test datasets Set5 and Setl4. Many 
recent state-of-the-art learning based image super resolution 
approaches [53], [19] accomplish their comparison based on 
this framework. Therefore, we also use the same 91 training 
images. We crop 4-5 sub-images of size 150 x 150 from each 
training image, according to its size, and this finally gives us 
421 training samples. We then evaluate the performance of the 
trained models on the Set5 and Setl4 dataset. 

c) JPEG deblocking. We train the diffusion models using the same 
training images as in the case of Gaussian denoising. In the 
test phase, we follow the test procedure in [33] for performance 
evaluation. The test images are converted to gray-value, and 
scaled by a factor of 0.5, resulting 200 images of size 240 x 160. 

3.4.4 Approximate training time 

Note that the calculation of the gradients of the loss function in 
(13) is very efficient even using a simple Matlab implementation, 
since it mainly involves 2D convolutions. The training time varies 
greatly for different configurations. Important factors include (1) 
model capacity, (2) number of training samples, (3) number of 
iterations taken by the L-BFGS algorithm, and (4) number of 
Gaussian RBF kernels used for function approximation. We report 
the most time consuming cases as follows. 

In training, computing the gradients with respect to the 
parameters of one stage for 400 images of size 180 x 180 takes 
about 35s (TNRDsxs), 75s (TNRDtxt) or 165s (TNRDqxq) 
using Matlab implementation on a server with CPUs: Intel(R) 
Xeon E5-2680 @ 2.80GHz (eight parallel threads using parfor 
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(a) Noisy no (20.17) (b) Stage 1: m (27.26) (c) Stage 2: U 2 (28.40) (d) Stage 3: na (28.18) (e) Stage 4: (28.63) (f) Stage 5: ns (29.63) 

Fig. 4. An image denoising example for noise level cr = 25 to illustrate how our learned TNRDixs works, (b) - (e) are intermediate results at stage 
1 - 4, and (f) is the output of stage 5, i.e., the final denoising result. 


in Matlab, 63 Gaussian RBF kernels for the influence function 
parameterization). We typically run 200 L-BFGS iterations for 
optimization. Therefore, the total training time, e.g., for the 
TNRDf ^7 model is about 5 x (200 x 75)/3600 = 20.8/i. Code 
for learning and inference is available on the authors’ homepage 
www.GPU4Vision.org^. For the training of the Gaussian denoising 
task, we have also accomplished a GPU implementation, which is 
about 4-5 times faster than our CPU implementation. 


4 Training FOR Gaussian DENOISING 


For the task of Gaussian denoising, we consider the following 
energy functional 

Nk ^ 

mmE(u) ='^pi{ki*u) + -\\u - f\\l. 

i=l 

By setting V{u) = f ||74 — /Hi and Q{u) = 0, we arrive at the 
following diffusion process with uq = f 


Ut = Ut-l - 


Nk 




i=l 


(15) 


where we explicitly use a convolution kernel ki (obtained by 
rotating the kernel ki 180 degrees) to replace the Kj for the 
sake of model simplicity, but we have to pad the input image. 
The gradients and required in training are computed 

from this equation. Detailed derivations are presented in the 
supplemental material. 

We started with the training for TNRD|^>< 5 . We first considered 
the greedy training phase to train a diffusion process up to 8 stages 
{i.e., T < 8 ), in order to observe the asymptotic behavior of the 
diffusion network. In the greedy training, only parameters in one 
stage were trained at a time. We exploited a plain initialization 
to start the training, namely linear Alters and influence functions 
were initialized from the modified DCT Alters and the function 
(/)(z) = 2z/{1 ^ z‘^), respectively. 

After the greedy training was completed, we conducted joint 
training for a diffusion model of certain stages (e.g., T = 2,5,8), 
to simultaneously tune the parameters in all stages. We initialized 
the joint training using parameters learned in greedy training, as 
this is guaranteed not to degrade the training performance. 

We Arst trained our diffusion models for the Gaussian denois¬ 
ing problem with standard deviation a = 25. The noisy training 


5. http://gpu4vision.icg.tugraz.at/binaries/nonlinear- diffusion.zip#pub94 


images were generated by adding synthetic Gaussian noise with 
cr = 25 to the clean images. Once we obtained a trained model, 
we evaluated its denoising performance on 68 natural images 
following the same test protocol as in [52], [10]. Figure 4 shows 
a denoising example for noise level a = 25 to illustrate the 
denoising process of our learned TNRD^^g diffusion network. 

We present the Anal results of the joint training in Table 
1 , together with a selection of recent state-of-the-art denoising 
algorithms, namely BM3D [15], LSSC [40], EPLL-GMM [60], 
opt-MRF [10], RTF model [33], the CSF model [52] and WNNM 
[26], as well as two similar approaches ARF [3] and opt-GD 
[18], which also train an optimized gradient descent inference. 
We downloaded these algorithms from the corresponding author’s 
homepage, and used them as is. Unfortunately, we are not able 
to present comparisons with [39], [32], as their codes are not 
available. 

From Table 1, one can see that (1) the performance of the 
TNRD |^^5 model saturates after stage 5, i.e., in practice, 5 stages 
are typically enough; (2) our TNRD|x 5 model has achieved 
significant improvement (28.78 v^.28.60), compared to a similar 
model CSF 5 X 5 , which has the same model capacity and (3) 
moreover, our TNRD |><5 model is on par with so far the best- 
reported algorithm - WNNM. It turns out that our trained models 
perform surprisingly well for image denoising. Then, a natural 
question arises: what is the critical factor for the effectiveness of 
the trained diffusion models? 

4.1 Understanding the proposed diffusion modeis 

There are actually two main aspects in our training model: (1) 
the linear Alters and (2) the influence functions. In order to have 
a better understanding of the trained models, we went through 
a series of experiments to investigate the impact of these two 
aspects. 

Concentrating on the model capacity of 24 Alters of size 5x5, 
we considered the training of a diffusion process with 10 steps, i.e., 
T = 10 for the Gaussian denoising of noise level cr = 25. We 
exploited two main classes of configurations: (A) the parameters 
of every stage are the same and (B) every diffusion stage is 
different from each other. In both configurations, we consider two 
cases: (I) only train the linear Alters with Axed influence function 
(f){z) = 2z/{1 z‘^) and (II) simultaneously train the Alters and 

influence functions. 

Based on the same training dataset and test dataset, we ob¬ 
tained the following results: (A.I) every diffusion step is the same, 
and only the Alters are optimized with Axed influence function. 
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(c) Truncated concave 


(d) Double-well penalty 


Fig. 5. The figure shows four characteristic influence functions (left plot 
in each subfigure) together with their corresponding penalty functions 
(right plot in each subfigure), learned by our proposed method in the 
TNRDgxs model. A major finding in this paper is that our learned penalty 
functions significantly differ from the usual penalty functions adopted 
in partial differential equations and energy minimization methods. In 
contrast to their usual robust smoothing properties which is caused by a 
single minimum around zero, most of our learned functions have multiple 
minima different from zero and hence are able to enhance certain image 
structures. See Sec. 4.3 for more information. 


This is a similar configuration to previous works [3], [18]. The 
trained model achieves a test performance of 28.47dB. (A.II) 
with additional tuning of the influence functions, the resulting 
performance is boosted to 28.60dB. (B.I) every diffusion step 
can be different, but only the linear filters are trained with fixed 
infiuence functions. The corresponding model obtains a result of 
28.56dB, which is equivalent to the variational model [10] with the 
same model capacity. Finally (B.II) with additional optimization 
of the infiuence functions, the trained model leads to a significant 
improvement with the result of 28.86dB. 

The analytical experiments demonstrate that without the train¬ 
ing of the infiuence functions, there is no chance to achieve 
significant improvements over previous works, no matter how 
hard we tune the linear filters. Therefore, we believe that the most 
critical factor of our proposed training model lies in the adjustable 
infiuence functions. A closer look at the learned infiuence func¬ 
tions of the TNRDgxg model in Sec.4.2 strengthens our argument. 

Comparing our proposed TNRD model to the CSF model [52], 
one can see that the degree of freedom is in principle the same, 
since in both models the filters and the non-linear functions can 
be learned. Therefore, one would expect a similar performance of 
both models in practice. However, it turns out the performance 
of the CSF model is inferior to our TNRD model in the case of 
Gaussian denoising task. The reason for the performance gap is 
still unclear and we plane to investigate it in future work. 


4.2 Learned influence functions 

A close inspection of the learned 120 penalty functions^ p in the 
TNRDgxg model demonstrated that most of the penalties resemble 

6. The penalty function p{z) is integrated from the influence function (^{z) 
according to the relation (f){z) = p'{z) 


four representative shapes shown in Figure 5. 

(a) Truncated convex penalty functions with low values around 
zero to promote smoothness. 

(b) Negative Mexican hat functions, which have a local minimum 
at zero and two symmetric local maxima. 

(c) Truncated concave functions with smaller values at the two 
tails. 

(d) Double-well functions, which have a local maximum (not a 
minimum any more) at zero and two symmetric local minima. 

At first glance, the learned penalty functions (except (a)) differ 
significantly from the usually adopted penalty functions used in 
PDF and energy minimization methods. However, it turns out that 
they have a clear meaning for image regularization. 

Regarding the penalty function (b), there are two critical points 
(indicated by red triangles). When the magnitude of the filter 
response is relatively small {i.e., less than the critical points), prob¬ 
ably it is stimulated by the noise and therefore the penalty function 
encourages smoothing operation as it has a local minimum at zero. 
However, once the magnitude of the filter response is large enough 
{i.e., across the critical points), the corresponding local patch 
probably contains a real image edge or certain structure. In this 
case, the penalty function encourages to increase the magnitude 
of the filter response, alluding to an image sharpening operation. 
Therefore, the diffusion process controlled by the infiuence func¬ 
tion (b), can adaptively switch between image smoothing (forward 
diffusion) and sharpening (backward diffusion). We find that the 
learned infiuence function (b) is closely similar to an elaborately 
designed function in a previous work [23], which leads to an 
adaptive forward-and-backward diffusion process. 

Similar forms of the learned penalty function in (c) with a 
concave shape are also observed in previous work on image prior 
learning [59]. This penalty function also encourages to sharpen 
the image edges. Concerning the learned penalty function (d), as 
it has local minima at two specific points, it prefers specific image 
structure, implying that it helps to form certain image structure. 
We also find that this penalty function is exactly the type of 
bimodal expert functions for texture synthesis employed in [30]. 

Therefore, our learned penalty functions confirmed existing 
robust penalties based prior models and many priors exploiting 
some unusual penalties, which can produce patterns and enhance 
preferred features. As a consequence, the diffusion process in¬ 
volving the learned infiuence functions does not perform pure 
image smoothing any more for image processing. In contrast, it 
leads to a diffusion process for adaptive image smoothing and 
sharpening, distinguishing itself from previous commonly used 
image regularization techniques. 

4.3 Pattern formation using the learned influence func¬ 
tions 

In the previous work on Gibbs reaction diffusion^ [59], it is 
shown that those unconventional penalty functions such as Figure 
5(c) have significant meaning in visual computation, as they can 
produce patterns. We also find that those unconventional penalty 
functions learned in our models can produce some interesting 
image patterns. 

7. The terminology of “reaction diffusion” in [59] is a bit different from 
ours. In our formulation, “reaction term” is related to the data term, while 
in [59], it means the diffusion term controlled by those downright penalty 
functions. 
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Fig. 6. Well distributed gradients & over stages for the TNRDixs model at the initialization point ©o with a plain setting. One can see that the 
“vanishing gradient” phenomenon [6] in the back-propagation phase of a conventional deep model does not appear in our training model. 



Fig. 7. Patterns synthesized from uniform noise using our learned diffu¬ 
sion models, (a) is generated by (16) using the parameters (linear filters 
and influence functions) in a stage of our learned TNRDixs for image 
denoising, (b) is generated by (16) using the parameters in a stage of 
our learned TNRD^xr for image super resolution and (c) is also from a 
stage of our learned TNRD^xr for image super resolution. 

We consider the following diffusion process involving our 
learned linear filters and the corresponding influence functions 


1 

At 


Nk 

= -Eh 


* {ki * Uf—i) , 


(16) 


where the filters ki and infiuence functions (pi are chosen from 
a certain stage of the learned models. Note that we do not incor¬ 
porate a reaction term in this diffusion model. We run (16) from 
starting points Uq (uniform noise images in the range [0, 255]), and 
it converges to a local minimum^. Some synthesized patterns are 
shown in Figure 7. One can see that the diffusion model with our 
learned infiuence functions and filters can produce edge-like image 
structure and repeated patterns from fully random images. This 
kind of diffusion model is known as Gibbs reaction diffusion in 
[59]. We provide another example in Figure 13 to demonstrate how 
our learned diffusion models can generate meaningful patterns for 
image super resolution. 

4.4 Important aspects of the training framework 

4.4.1 Influence of initialization 

Our training model is also a deep model with many stages (layers), 
but we find that it is not very sensitive to initialization. Based on 
the training for Gaussian denoising, we conducted experiments 
with fully random initializations and some plain settings. 

8. The corresponding diffusion processes are unstable, and therefore we have 
to restrict the image dynamic range to [0, 255]. 


We firstly investigated the case of greedy training where the 
model was trained stage by stage and the parameters of one stage 
were trained at a time. We initialized the parameters using fully 
random numbers in the range [—0.5, 0.5]. It turns out that the 
resulting models with different initializations lead to a deviation 
within 0.0IdB in the test phase. That is to say, the greedy training 
strategy is not sensitive to initialization. 

Then, we considered the case of joint training, where all the 
parameters in all stages were trained at a time. We also initialized 
the training with fully random numbers in the range [—0.5, 0.5]. 
In this case, it turns out that the resulting models lead to inferior 
results, e.g., in the case of TNRD^^^g (28.61 v^.28.78). However, 
plain initializations can generate equivalent results. For example, 
we considered a plain initialization (all stages were initialized 
from the modified DCT filters and an unified infiuence function 
(p^z) = 2z/{l ^ z‘^)), the resulting models performed almost 
the same as those models trained from some good initializations 
such as parameters obtained from greedy training, e.g.,TNRD|>< 5 , 
(28.75 v^.28.78) and TNRD^^^, (28.91 v^.28.92). 

We believe that this appealing property of our training frame¬ 
work is attributed to the well-distributed gradients across stages. 
We show in Figure 6 an example to illustrate the gradients of the 
training loss function with respect to the parameter of all stages. 
One can see that the well-known phenomenon of “vanishing 
gradient” [6] in the back-propagation phase of a usual deep model 
does not appear in our training model. We believe that the reason 
for the well-distributed gradients is that our training model is more 
constrained. For example, in a more general sense, the rotated 
kernel ki in our formulation is not necessary to be the rotated 
version of the kernel ki, and it can be an arbitrary kernel. However, 
we stick to this form, as it has a clear meaning derived from energy 
minimization. 

4.4.2 Influence of the number of training samples 
In our training, we do not consider any regularization for the 
training parameters, and we finally reach good-performing models. 
A probable reason is that we have exploited sufficient training 
samples (400 samples of size 180 x 180). Thus an interesting 
question arises: how many samples are sufficient for our training? 


In order to answer this question, we re-train the TNRD^ 


5x5 


model using different size of training dataset, and then evaluate 
the test performance of trained models. We summarize the results 
in Figure 8. One can see that (1) too few training samples (e.g.. 
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Fig. 8. Influence of the number of training examples for the training 
model TNRD|x 5 



Fig. 9. Influence of the filter size (based on a relatively small training 
data set of 400 images of size 180 x 180) 

s!ss!sieu»En£;as!iin 

(a) 48 filters of size 7 x 7 in stage 1 

s^seiMDBaaBani 

(b) 48 filters of size 7 x 7 in stage 5 

Fig. 10. Trained filters (in the first and last stage) of the TNRD^xr 
model for the noise level a = 25. We can find first, second and higher- 
order derivative filters, as well as rotated derivative filters along different 
directions. These filters are effective for image structure detection, such 
as image edge and texture. 

40 images) will clearly lead to over-fitting, thus inferior test 
performance, and (2) 200 images are typically enough to prevent 
over-fitting. 

4.4.3 Influence of the filter size 

In our model, the size of involved filters is a free parameter. In 
principle, we can exploit filters of any size, but in practice, we 
need to consider the trade-off between run time and accuracy. 


In order to investigate the influence of the filter size, we 
increase the filter size to 7 x 7 and 9x9. We find that increasing the 
filter size from 5 x 5 to 7 x 7 brings a significant improvement of 
0.14dB ( TNRDy><Y v^y.TNRDl^^g) as show in Table 1. However, 
when we further increase the filter size to 9 x 9, the resulting 
TNRDg^g only leads to a performance of 28.96dB (a slight 
improvement of 0.05dB relative to the TNRDy^^y model). We can 
conjecture that further increasing the filter size to 11 x 11 might 
bring negligible improvements. 

Note that the above conclusion is drawn from a relatively 
small training data set of 400 images of size 180 x 180. It 
should be mentioned that when the size of the model increase, 
the size of training data set should also increase to avoid over¬ 
fitting. However, our current CPU implementation for training 
prevents us from training with larger model and large-scale data 
sets (millions). A faster implementation on GPUs together with the 
stochastic gradient descent optimization strategy is left to future 
work. 

We also consider a model with smaller filters, 3x3. We 
summarize the results of different model capacities in Figure 9. 
In practice, we prefer the TNRD^xy model as it provides the best 
trade-off between performance and computation time. Therefore, 
in later applications, we only consider TNRD^^y models. 

Fig. 10 shows the trained filters of the TNRDy^y model in 
the first and last stage for the task of Gaussian denoising. One 
can find many edge and image structure detection filters along 
different directions and in different scales. 

4.5 Training for different noise ieveis and comparison 
to recent state-of-the-arts 

The above training experiments are based on Gaussian noise of 
level a = 25. We also trained diffusion models for the noise 
levels cr = 15 and cr = 50. The test performance is summarized 
in Table 1, together with comparison to very recent state-of-the- 
art denoising algorithms. In experiments, we observed that joint 
training can always gain an improvement of about 0.1 dB over the 
greedy training for the cases of T > 5. 

From Table 1, one can see that for all noise levels, the re¬ 
sulting TNRDyx? model achieves the highest average PSNR. The 
TNRDy><y model outperforms the benchmark - BM3D method by 
0.35dB in average. This is a notable improvement as few methods 
can surpass BM3D more than 0.3dB in average [36]. Moreover, 
the TNRDy^y model also surpasses the best-reported algorithm - 
WNNM method, which is quite slow as shown in Table 2. 

4.6 Run time 

The algorithm structure of our TNRD model is similar to the 
CSF model, which is well-suited for parallel computation on 
GPUs. We implemented our trained models on GPU using CUBA 
programming to speed up the inference procedure, and finally 
it leads to significantly improved performance, see Table 2. We 
make a run time comparison to other denoising algorithms based 
on strictly enforced single-threaded CPU computation ( e.g., start 
Matlab with -singleCompThread) for a fair comparison, see Table 
2. We only present the results of some selective algorithms, which 
either have the best denoising result or run time performance. We 
refer to [52] for a comprehensive run time comparison of various 
algorithms^. 

9. LSSC, EPLL, opt-MRF and RTF^ methods are much slower than BM3D 
on the CPU, cf. [52]. 
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Method 


a 

St 

TNRDsxs 

TNRD 7 X 7 

15 

25 

50 

a = 

15 

BM3D 

31.08 

28.56 

25.62 2 

31.14 

31.30 

LSSC 

31.27 

28.70 

25.72 5 

31.30 

31.42 

EPLL-GMM 

31.19 

28.68 

25.67 8 

31.34 

31.43 

opt-MRF-j.><-j. 

31.18 

28.66 

25.70 

a = 

25 

RTF® 

- 

28.75 

2 

28.58 

28.77 

WNNM 

31.37 

28.83 

25.83 5 

28.78 

28.92 

CSFi^s 

31.14 

28.60 

8 

28.83 

28.95 

CSF5^7 

31.24 

28.72 

- 

(7 = 

50 

ARF|x5 

30.70 

28.20 

- ~2^ 

25.54 

25.78 

opt-GDj°5 

- 

28.39 

5 

25.80 

25.96 




8 

25.87 

26.01 






TABLE 1 

Average PSNR (dB) on 68 images from [48] for image denoising with 
fj = 15,25 and 50. All the TNRD models are jointly trained. Note that 
among those algorithms similar to our model, opt-MRE^^r, ARFI^s 
and opt-GDgx 5 only train the filters with fixed penalty function 
log(l + z^). In the opt-MRE^^r model, 48 filters of size 7 x 7 (2304 
free parameters), for ARFl^s, 13 filters of size 5 x 5 (325 free 
parameters) and for the opt-GDgx 5 algorithm, 24 filters of size 5x5 
(600 free parameters) are trained. The CSF model and our approach 
train both the filters and nonlinearities, thus involving more parameters, 
e.g., the TNRD^xr model involves 26,645 free parameters and the 
corresponding the CSF^^r model involves 24,245 free parameters. 


Method 

256^ 

512^ 

1024^ 

2048^ 

3072^ 

BM3D [15] 

1.1 

4.0 

17 

76.4 

176.0 

CSF|,<7 [52] 

3.27 

11.6 

40.82 

151.2 

494.8 

WNNM [26] 

122.9 

532.9 

2094.6 

- 

- 


0.51 

1.53 

5.48 

24.97 

53.3 

TNRDi^s 

0.43 

0.78 

2.25 

8.01 

21.6 

0.005 

0.015 

0.054 

0.18 

0.39 


1.21 

3.72 

14.0 

62.2 

135.9 

TNRD5^, 

0.56 

1.17 

3.64 

13.01 

30.1 

0.01 

0.032 

0.116 

0.40 

0.87 


TABLE 2 

Run time comparison for image denoising (in seconds) with different 
implementations. (1) The run time results with gray background are 
evaluated with the single-threaded implementation on Intel(R) Xeon(R) 
CPU E5-2680 v 2 @ 2.80GHz; ( 2 ) the blue colored run times are 
obtained with multi-threaded computation using Matlab parfor on the 
above CPUs; (3) the run time results colored in red are executed on a 
NVIDIA GeForce GTX 780Ti GPU. We do not count the memory 
transfer time between CPU/GPU for the GPU implementation (if 
counted, the run time will nearly double) 

We see that our TNRD model is generally faster than the CSF 
model with the same model capacity. It is reasonable, because 
in each stage the CSF model involves additional DFT and inverse 
DTF operations, i.e., our model only requires a portion of the com¬ 
putation of the CSF model. Even though the BM3D is a non-local 
model, it still possesses high computational efficiency. In contrast, 
another non-local model - WNNM achieves compelling denoising 
results at the expense of huge computation time. Moreover, the 
WNNM algorithm is hardly applicable for high resolution images 
(e.g., 10 mega-pixels) due to its huge memory requirements. Note 
that our model can be also easily implemented with multi-threaded 
CPU computation. 

In summary, our TNRDy^^y model outperforms these recent 
state-of-the-arts, meanwhile it is the fastest method even with a 
CPU implementation. We present an illustrative denoising exam¬ 
ple in Figure 11 on an image from the test dataset. More denoising 
examples can be found in the supplemental material based on 
images from the test dataset and a megapixel-size natural image 
of size 1050 x 1680. 


5 Single image super resolution (SISR) 

As demonstrated in the last section that our trained diffusion model 
can lead to explicit backward diffusion process, which sharpens 
image structures like edges. This is the very property demanded 
for the task of image super resolution. Therefore, we are motivated 
to investigate the SISR problem with our proposed approach. 

We start with the following energy functional 

Nu ^ 

mmE{u) = '^pi{ki*u) + ip\Au - f\\l , ( 17 ) 

i=l 

where the linear operator A is a bicubic interpolation which links 
the high resolution (HR) image h to the low resolution (LR) image 
/ via / = Ah. Casting ^{u) = f — /jH and Q(u) = 0, the 
energy functional (17) suggests the following diffusion process 

Nk _ \ 

M * 4(4 * tit-i) + x^A^iAut., - /) j , 

( 18 ) 

where the starting point Uq is given by the direct bicubic inter¬ 
polation of the LR image /. Computing the gradients and 
with respect to (18) can be done with little modifications 
to the derivations for image denoising. Detailed derivations are 
presented in the supplemental material. 

We considered the model capacity of TNRD^^^, and trained 
diffusion models for three upscaling factors x 2, x 3 and x4, using 
exactly the same 91 training images as in previous works [54], 
[53]. The trained models are evaluated on two different test data 
sets: Sets and Setl4. Following previous works [54], [19], [53], 
the trained models are only applied to the luminance component 
of an image, and a regular bicubic upscaling method is applied to 
the color components. 

The test results are summarized in Table 3 and Table 4. One 
can see that in terms of average PSNR, our trained diffusion model 
TNRDy^y leads to significant improvements over very recent 
state-of-the-arts in all cases, meanwhile it is still among the fast 
algorithms'^. A SISR example is shown in Figure 12 to illustrate 
its effectiveness. One can see that our approach can obtain more 
accurate image edges, as shown in the zoom-in parts. More SISR 
examples can be found in the supplemental material. 

We apply the learned diffusion parameters to the diffusion 
equation (16). It turns out that the diffusion process can also 
generate some interesting patterns from random images, as shown 
in Figure 7. We believe that this ability to generate image patterns 
from weak evidence is the main reason for the superiority of 
our trained model for the SISR task. In order to further validate 
our argument, we carry out a toy SISR experiment based on a 
synthesized image with repeated hexagons. The results are shown 
in Figure 13, where one can see that our trained model can better 
reconstruct those repeated image structures. 

6 JPEG DEBLOCKING EXPERIMENTS 

In order to further demonstrate the applicability of our proposed 
framework for those problems with a non-smooth data term, we 
investigate the JPEG deblocking problem - suppressing the block 
artifacts in the JPEG compressed images, which is formulated as a 

10. Note that our approach is a Matlab implementation, while some of other 
algorithms are based on C-i-i- implementations, such as SR-CNN. 
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(a) Noisy, 20.17dB 


(d) WNNM, 27.94dB/CPU: 393.2s 


(b) BM3D, 27.53dB/CPU: 2.5s (c) CSF^^^, 28.00dB/GPU: 0.55s 



(e) TNRDf^g, 28.16dB/GPU: 9.1ms (f) TNRD^^^^, 28.23dB/GPU: 20.3ms 




Fig. 11. Denoising results on a test image of size 481 x 321 {a = 25) by different methods (compared with BM3D [15], WNNM [26] and CSF model 
[52]), together with the corresponding computation time either on CPU or GPU. Note the differences in the highlighted region. 


Set5 


Bicubic 

K-SVD [58] 

ANR [54] 

SR-CNN [19] 

RFL [53] 

TNRD^><7 

images 

scale 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

baby 

2 

37.07 

- 

38.25 

8.21 

38.44 

1.39 

38.30 

0.38 

38.39 

1.31 

38.51 

1.52 

bird 

2 

36.81 

- 

39.93 

2.67 

40.04 

0.44 

40.64 

0.14 

40.99 

0.52 

41.29 

0.59 

butterfly 

2 

27.43 

- 

30.65 

2.14 

30.48 

0.38 

32.20 

0.10 

32.46 

0.41 

33.16 

0.56 

head 

2 

34.86 

- 

35.59 

2.46 

35.66 

0.41 

35.64 

0.13 

35.70 

0.48 

35.71 

0.60 

woman 

2 

32.14 

- 

34.49 

2.45 

34.55 

0.43 

34.94 

0.13 

35.19 

0.46 

35.50 

0.57 

average 

2 

33.66 

- 

35.78 

3.59 

35.83 

0.61 

36.34 

0.18 

36.55 

0.64 

36.83 

0.77 

baby 

3 

33.91 

- 

35.08 

3.77 

35.13 

0.79 

35.01 

0.38 

35.04 

0.79 

35.28 

1.52 

bird 

3 

32.58 

- 

34.57 

1.34 

34.60 

0.27 

34.91 

0.14 

35.15 

0.31 

36.11 

0.59 

butterfly 

3 

24.04 

- 

25.94 

1.08 

25.90 

0.24 

27.58 

0.10 

27.18 

0.25 

28.90 

0.56 

head 

3 

32.88 

- 

33.56 

1.35 

33.63 

0.24 

33.55 

0.13 

33.68 

0.29 

33.78 

0.60 

woman 

3 

28.56 

- 

30.37 

1.14 

30.33 

0.24 

30.92 

0.13 

30.92 

0.28 

31.77 

0.57 

average 

3 

30.39 

- 

31.90 

1.74 

31.92 

0.35 

32.39 

0.18 

32.39 

0.39 

33.17 

0.77 

baby 

4 

31.78 

- 

33.06 

2.63 

33.03 

0.59 

32.98 

0.38 

33.05 

0.60 

33.29 

1.52 

bird 

4 

30.18 

- 

31.71 

0.70 

31.82 

0.18 

31.98 

0.14 

32.14 

0.23 

32.98 

0.59 

butterfly 

4 

22.10 

- 

23.57 

0.54 

23.52 

0.14 

25.07 

0.10 

24.44 

0.19 

26.22 

0.56 

head 

4 

31.59 

- 

32.21 

0.66 

32.27 

0.16 

32.19 

0.13 

32.31 

0.22 

32.57 

0.60 

woman 

4 

26.46 

- 

27.89 

0.72 

27.80 

0.23 

28.21 

0.13 

28.31 

0.23 

29.17 

0.57 

average 

4 

28.42 

- 

29.69 

1.05 

29.69 

0.26 

30.09 

0.18 

30.05 

0.29 

30.85 

0.77 


TABLE 3 

PSNR (dB) and run time (s) performance for upscaling factors x2, x3 and x4 on the Set5 dataset. All the methods use the same 91 training 

images as in [54]. 


non-smooth optimization problem. Motivated by [8], we consider 
the following variational model based on the FoE image prior 

Nk 

m.mE{u) = '^pi{ki*u) +Iq{Du) , (19) 

i=l 

where Iq is a indicator function over the set Q (quantization 
constraint set). In JPEG compression, information loss happens in 
the quantization step, where all possible values in the range [d — 
0.5, (i + 0.5] ((i is an integer) are quantized to a single number d. 
Given a compressed data, we only know d. Therefore, all possible 
values in the interval [d — 0.5, (i + 0.5] define a convex set Q 


which is a box constraint. The sparse matrix D G denotes 

the block DCT transform. We refer to [8] for more details. 

By setting 'D{u) = 0 and G{u) = Xq{Du), we obtain the 
following diffusion process 

Ut = -D"^projQ (d fut-i - ki * (plikj * Wt-i))) , 

( 20 ) 

where projQ(-) denotes the orthogonal projection onto Q. More 
details can be found in the supplemental material. 

We also trained diffusion models for the JPEG deblocking 
problem. We followed the test procedure in [33] for perfor¬ 
mance evaluation. We distorted the images by JPEG block- 
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Setl4 

images 

Bicubic 

K-SVD [58] 

ANR [04] 

SR-CNN [19] 

RFL [53] 

TNRD?^, 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

baboon 

23.21 

- 

23.52 

3.54 

23.56 

0.77 

23.60 

0.40 

23.57 

0.75 

23.62 

1.30 

barbara 

26.25 

- 

26.76 

6.24 

26.69 

1.23 

26.66 

0.70 

26.63 

1.18 

26.25 

1.75 

bridge 

24.40 

- 

25.02 

3.98 

25.01 

0.80 

25.07 

0.44 

25.11 

0.81 

25.29 

1.19 

coastguard 

26.55 

- 

27.15 

1.54 

27.08 

0.36 

27.20 

0.17 

27.16 

0.35 

27.12 

0.65 

comic 

23.12 

- 

23.96 

1.37 

24.04 

0.27 

24.39 

0.15 

24.27 

0.34 

24.67 

0.65 

face 

32.82 

- 

33.53 

1.10 

33.62 

0.24 

33.58 

0.13 

33.65 

0.29 

33.82 

0.57 

flowers 

27.23 

- 

28.43 

2.66 

28.49 

0.57 

28.97 

0.30 

28.86 

0.61 

29.55 

0.90 

foreman 

31.18 

- 

33.19 

1.54 

33.23 

0.30 

33.35 

0.17 

33.87 

0.36 

34.65 

0.65 

lenna 

31.68 

- 

33.00 

3.89 

33.08 

0.79 

33.39 

0.44 

33.38 

0.77 

33.77 

1.19 

man 

27.01 

- 

27.90 

3.81 

27.92 

0.76 

28.18 

0.44 

28.20 

0.80 

28.52 

1.17 

monarch 

29.43 

- 

31.10 

6.13 

31.09 

1.13 

32.39 

0.66 

32.10 

1.12 

33.61 

1.66 

pepper 

32.39 

- 

34.07 

3.84 

33.82 

0.80 

34.35 

0.44 

34.55 

0.82 

35.06 

1.20 

ppt3 

23.71 

- 

25.23 

4.53 

25.03 

1.01 

26.02 

0.58 

25.84 

0.98 

27.08 

1.48 

zebra 

26.63 

- 

28.49 

3.36 

28.43 

0.69 

28.87 

0.38 

29.03 

0.72 

29.40 

1.04 

average performance 

27.54 

- 

28.67 

3.40 

28.65 

0.69 

29.00 

0.39 

29.02 

0.71 

29.46 

1.10 


TABLE 4 

Upscaling factor x3 performance in terms of PSNR(clB) and runtime (s) per image on the Set14 dataset. 



(d) SR-CNN [19] / 32.39dB (e) RFL [53] / 32.10dB (f) TNRD ^><7 / 33.61dB 

Fig. 12. A super resolution example for the “Monarch” image from Set14 with an upscaling factor x3. Note the differences in the highlighted region 
that our model achieves more clean and sharp image edges. Best viewed on screen and zoom in. 




SsSsSs*sSs*s*sSs«sS: 







(a) Original (b) Bicubic / 14.18dB (c) ANR / 15.07dB (d) SR-CNN / 15.65dB (e) RFL / 15.03dB (f) TNRD^^^^ / 17.22dB 


Fig. 13. A toy experiment on a synthesized image with repeated hexagons for the upscaling factor x3. 


ing artifacts. We considered three compression quality settings 
q = 10, 20 and 30 for the JPEG encoder. 

We trained three nonlinear diffusion TNRDyx? models for dif¬ 
ferent compression parameter q. We found that for JPEG deblock¬ 
ing, 4 stages are already enough. Results of the trained models are 
shown in Table 5, compared with several representative deblocking 
approaches. We see that our trained ENRO^^y outperforms all 
the competing approaches in terms of PSNR. Concerning the 


run time, our model takes about 11.2s to handle an image of 
size 1024 x 1024 with CPU computation, while the strongest 
competitor (in terms of run time) - SADCT consumes about 
56.5s^^ Eurthermore, our model is extremely fast on GPUs. Eor 
the same image size the GPU implementation takes about 0.095s. 
See the supplemental material for JPEG deblocking examples. 

11. RTF is slower than SADCT, as it depends on the output of SADCT 































































IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 


14 


q 

JPEG 

decoder 

TGV 

[8] 

Dic- 

SR[9] 

SADCT 

[22] 

RTF[33] 

TNRDIxt 

10 

26.59 

26.96 

27.15 

27.43 

27.68 

27.85 

20 

28.77 

29.01 

29.03 

29.46 

29.83 

30.06 

30 

30.05 

30.25 

30.13 

30.67 

31.14 

31.41 


TABLE 5 

JPEG deblocking results for natural images, reported with average 
PSNR values. 


7 Discussion, summary and future work 

7.1 Summary 

We have proposed a trainable nonlinear reaction diffusion frame¬ 
work for effective image restoration. Its critical point lies in the 
additional training of the influence functions. We have trained our 
models for the problem of Gaussian denoising, single image super 
resolution and JPEG deblocking. Based on standard test datasets, 
the trained models result in the best-reported results. We believe 
that the effectiveness of the trained diffusion models is attributed 
to the following desired properties of the models 

• Anisotropy. In the trained Alters, we can And rotated deriva¬ 
tive Alters in different directions, cf. Fig 10, which will make 
the diffusion happen in some special directions. 

• Higher order. The learned Alters contain Arst, second and 
higher-order derivative Alters, cf. Fig 10. 

• Adaptive forward/backward diffusion through the learned 
nonlinear functions. Nonlinear functions corresponding to 
explicit backward diffusion appear in the learned nonlinear- 
ity, cf. Fig 5. 

Meanwhile, the structure of trained models is very simple and 
well-suited for parallel computation on GPUs. As a consequence, 
the resulting algorithms are signiAcantly faster than all competing 
algorithms and hence are also applicable to the restoration of high 
resolution images. 

7.2 Discussion 

One possible limitation of the proposed TNRD approach is that 
one has to deAne the ground truth - the expected output of the dif¬ 
fusion network during training. For image restoration applications 
in this paper, this is not a problem as we have a clear choice for 
the ground truth. However, for those applications with ambiguous 
ground truth, e.g., image structure extraction [57], we will have to 
make efforts to deAne the ground truth. 

Furthermore, the trained diffusion networks will only perform 
well in the way they are trained. For example, the trained model 
based on noise level a = 25 will break for an input image with 
noise cr = 50, and the trained model for upscaling factor x3 
will also lead to inferior performance when it is applied to the 
SISR problem of upscaling factor x2. It is generally hard to train 
a universal diffusion model to handle all the noise levels or all 
upscaling factors. 

Our approach is to optimize a time-discrete PDF, which is 
inspired by FoE based model, but we do not aim to minimize a 
series of FoE based energies. Our model directly learns an optimal 
trajectory for a certain possibly unknown energy functional, the 
minimizer of which provides a good estimate of the demanded 
solution. Probably, such a functional can not be modeled by a 
single FoE energy, while our learned gradient descent/forward- 
backward steps provide good approximation to the local gradients 
of this unknown functional. 


7.3 Future work 

From an application point of view, we think that it will be interest¬ 
ing to consider learned nonlinear reaction diffusion based models 
also for other image processing tasks such as image inpainting, 
blind image deconvolution, optical flow. Moreover, since learning 
the influence functions turned out to be crucial, we believe that 
learning optimal nonlinearities (i.e., activation functions) in stan¬ 
dard CNs could lead to a similar performance increase. There are 
actually two recent works [2], [29] to investigate a parameterized 
ReFU function in standard deep convolutional networks, which 
indeed brings improvements even with little freedom to tune 
the activation functions. Finally, it will also be interesting to 
investigate the unconventional penalty functions learned by our 
approach in usual energy minimization approaches. 
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